Efficient Knowledge Distillation from an Ensemble of Teachers
نویسندگان
چکیده
This paper describes the effectiveness of knowledge distillation using teacher student training for building accurate and compact neural networks. We show that with knowledge distillation, information from multiple acoustic models like very deep VGG networks and Long Short-Term Memory (LSTM) models can be used to train standard convolutional neural network (CNN) acoustic models for a variety of systems requiring a quick turnaround. We examine two strategies to leverage multiple teacher labels for training student models. In the first technique, the weights of the student model are updated by switching teacher labels at the minibatch level. In the second method, student models are trained on multiple streams of information from various teacher distributions via data augmentation. We show that standard CNN acoustic models can achieve comparable recognition accuracy with much smaller number of model parameters compared to teacher VGG and LSTM acoustic models. Additionally we also investigate the effectiveness of using broadband teacher labels as privileged knowledge for training better narrowband acoustic models within this framework. We show the benefit of this simple technique by training narrowband student models with broadband teacher soft labels on the Aurora 4 task.
منابع مشابه
Born Again Neural Networks
Knowledge distillation techniques seek to transfer knowledge acquired by a learned teacher model to a new student model. In prior work, the teacher typically is a high-capacity model with formidable performance, while the student is more compact. By transferring knowledge, one hopes to benefit from the student’s compactness while suffering only minimal degradation in performance. In this paper,...
متن کاملQubit and Entanglement assisted Optimal Entanglement Concentration
We present two methods for optimal entanglement concentration from pure entangled states by local actions only. However a prior knowledge of the Schmidt coefficients is required. The first method is optimally efficient only when a finite ensemble of pure entangled states are available whereas the second method realizes the single pair optimal concentration probability. We also propose an entang...
متن کاملOn-line Learning of an Unlearnable True Teacher through Mobile Ensemble Teachers
On-line learning of a hierarchical learning model is studied by a method from statistical mechanics. In our model a student of a simple perceptron learns from not a true teacher directly, but ensemble teachers who learn from the true teacher with a perceptron learning rule. Since the true teacher and the ensemble teachers are expressed as non-monotonic perceptron and simple ones, respectively, ...
متن کاملThe Impact of Collegial Instruction on Peers’ Pedagogical Knowledge (PK): An EFL Case Study
Shared responsibilities such as mentoring, instruction, learner monitoring and classroom management enable the peers to observe, review, reflect on and learn from the overall practical professional expertise of one another through collegial instruction experience. The present exploratory case study has The present exploratory case study has attempted to study collegial teaching as an innovative...
متن کاملEnsemble Distillation for Neural Machine Translation
Knowledge distillation describes a method for training a student network to perform better by learning from a stronger teacher network. In this work, we run experiments with different kinds of teacher networks to enhance the translation performance of a student Neural Machine Translation (NMT) network. We demonstrate techniques based on an ensemble and a best BLEU teacher network. We also show ...
متن کامل